Create a new markdown document to answer questions from this lab, and save it in a sensible place on your computer. If you select the “From Template” option there is a “Lab Report” template that will work very nicely for you!
In this lab you will use the ggplot2
package to generate
graphics. The “The Grammar of Graphics,” is the theoretical basis for
the ggplot2
package. Much like how we construct sentences
in any language by using a linguistic grammar (nouns, verbs, etc.), the
grammar of graphics allows us to specify the components of a statistical
graphic.
In short, the grammar tells us that:
A statistical graphic is a
mapping
ofdata
variables toaes
thetic attributes ofgeom
etric objects.
We can break a graphic into the following three essential components:
data
: the data-set comprised of variables that we
plotgeom
: this refers to our type of geom
etric
objects we see in our plot (points, lines, bars, etc.)aes
: aesthetic attributes of the geometric object that
we can perceive on a graphic. For example, x/y position, color, shape,
and size. Each assigned aesthetic attribute can be mapped to a variable
in our data-set.Remember: Unless I indicate “text only”, you are expected to include all code you need to answer questions. In other words, if you are asked a question about a graphic, include the code for that graphic.
Resources: You will probably want to refer to the R for Data Science data visualization chapter.
We will use the following packages, which are all contained in the tidyverse package. The
tidyverse
package is actually a collection of packages that
share an underlying common philosophy of coding and “tidy” data. Today
we are using the following:
dplyr
: for data wranglingggplot2
: for data visualizationreadr
: for reading in dataRemember to use the library
command to load
tidyverse
, openintro
packages into your
environment.
Today we will practice data visualization using data on births from
the state of North Carolina. Copy, paste and run the code below to load
the data into your workspace (i.e. using the console). These data are
found in the openintro
package.
data(ncbirths)
The data set that shows up in your Environment is a large data frame. Each observation or case is a birth of a single child.
The data
command instructs R to load some data built
into a package. The workspace environment in the upper right hand corner
of the R Studio window should now list a data set called
ncbirths
with 1000 observations (rows or cases) and 13
variables (columns).
You can see the dimensions of this data frame (# of rows and
columns), the names of the variables, the variable types, and
the first few observations using the glimpse
function.
There are LOTS of ways to get a sense of your data, glimpse
is just one of them. You’ll pick up more, and identify your favorites,
as you get more practice.
glimpse(ncbirths)
We can see that there are 1000 observations and 13 variables in this
data set. The variable names are fage
, mage
,
mature
, etc. This output also tells us that some variables
are numbers…some specifically integers <int>
, others
are numbers with decimals <dbl>
. Some of the
variables are factors <fct>
(categories). It is a
good practice to see if R is treating variables as factors
<fct>
; as numbers <int>
or
<dbl>
(basically numbers with decimals); or as
characters (i.e. text) <chr>
. We can change these
types if we don’t like them.
habit
to be? What variable type is visits
?
(answer with text only, but use the glimpse
function in the
console to get the answer)You can view the data by clicking on the name ncbirths
in the Environment pane (upper right window). This will bring
up an alternative display of the data set in the Data Viewer
(upper left window). R has stored these data in a kind of spreadsheet
called a data frame. Each row represents a different birth: the
first entry or column in each row is simply the row number (it’s a
different color), the rest are the different variables that were
recorded for each birth. You can close the data viewer by clicking on
the x
in the upper left hand corner.
It is a good idea to try kitting your document from time to time as you go along! Go ahead, and make sure your document is knitting. Note that knitting automatically saves your .Rmd file, too.
We will explore three different types of graphs initially.
scatterplots
boxplots
histograms
Scatterplots allow you to investigate the relationship between two numerical variables. While you may already be familiar with this type of plot, let’s view it through the lens of the Grammar of Graphics. Specifically, we will graphically investigate the relationship between the following two numerical variables in the births data frame:
weeks
: length of a pregnancy on the horizontal “x” axis
andweight
: birth weight of a baby in pounds on the
vertical “y” axisggplot(data = ncbirths, aes(x = weeks, y = weight)) +
geom_point()
Let’s view this plot through the grammar of graphics. Within the
ggplot()
function call, we specified:
nc
by setting
data = ncbirths
aes
thetic mapping
determines the
visuals of the plot with aes(x = weeks, y = weight)
weeks
maps to the x
-position
aes
theticweight
maps to the y
-position
aes
thetic.We also add a layer to the ggplot()
function call using
the +
sign. The layer in question specifies the
geom
etric object here as point
s, by specifying
geom_point()
.
Finally, we can also add axis labels and a title to the plot like so.
Again we add a new layer, this time a labs
or labels
layer.
ggplot(data = ncbirths, aes(x = weeks, y = weight)) +
geom_point() +
labs(x = "Length of pregnancy (in weeks)", y = "Birth weight of baby (lbs)",
title = "Relationship between pregnancy duration and newborn weight")
Is there a positive or negative relationship between these variables? (text only to answer)
Make a graph showing weeks
again on the x axis and
the variable gained
on the y axis (the amount of weight a
mother gained during pregnancy). Include axis labels with measurement
units, and a title.
Study the code below, and the resulting graphical output. Note
that I added a new argument of color = premie
inside the aes
thetic mapping. The variable
premie
indicates whether a birth was early (premie) or went
full term. Please answer with text:
A. What did adding the argument
color = premie
accomplish?
B. How many variables are now displayed on this plot?
C. What appears to (roughly) be the pregnancy length cutoff for classifying a newborn as a “premie”” versus a “full term”?
ggplot(data = ncbirths, aes(x = weeks, y = gained, color = premie))+
geom_point() +
labs(x = "Pregnancy length (wks)", y = "Maternal weight gain (lbs)")
mage
) and birth weight of newborns on the
y axis (weight
). Color the points on the plot based on the
gender of the resulting baby (variable called gender
). Does
there appear to be any strong relationship between a mother’s age and
the weight of her newborn? Does the sex of the child seem to be a
factor?Histograms are useful plots for showing how many elements of a single numerical variable fall in specified bins. This is a very useful way to get a sense of the distribution of your data. Histograms are often one of the first steps in exploring data visually.
For instance, to look at the distribution of pregnancy duration
(variable called weeks
):
ggplot(data = ncbirths, aes(x = weeks))+
geom_histogram()
A few things to note here:
aes()
: the
single numerical variable weeks
. You don’t need to compute
the y
-aes
thetic: R calculates it
automatically.geom_histogram()
We can change the binwidth (and thus the number of bins), as well as the colors.
ggplot(data = ncbirths, aes(x = weeks))+
geom_histogram(binwidth = 1, color = "white", fill = "steelblue")
Note that none of these arguments went inside the
aes
thetic mapping
argument as they do not
specifically represent mappings of variables to visual properties.
Inspect the histogram of the weeks
variable. Answer
each of the following with text only.
A. The y axis is labeled count. What is specifically being counted in this case? Hint: think about what each case is in this data set.
B. What appears to be roughly the average length of pregnancies in weeks?
C. If we changed the binwidth to 100, how many bins would there be? Roughly how many cases would be in each bin?
Make a histogram of the birth weight
of newborns
(which is in lbs), including a title and axis labels.
Faceting is used when we’d like to create small multiples of the same plot over a different categorical variable. By default, all of the small multiples will have the same vertical axis.
For example, suppose we were interested in looking at whether
pregnancy length varied by the maturity status of a mother (column name
mature
). This is what is meant by “the distribution of one
variable over another variable”: weeks
is one variable and
mature
is the other variable. In order to look at
histograms of weeks
for older and more mature mothers, we
add a plot layer facet_wrap(~ mature, ncol = 1)
. The
ncol = 1
argument just tells R to stack the two histograms
into one column.
ggplot(data = ncbirths, aes(x = weeks)) +
geom_histogram(binwidth = 1, color = "white", fill = "steelblue") +
facet_wrap(~ mature, ncol = 1)
weight
split by
gender
of the child. Set the binwidth to 0.5. Which gender
appears to have a slightly larger average birth weight?While histograms can help to show the distribution of data, boxplots
have much more flexibility, and can provide even more information in a
single graph. The y aes
thetic is the numeric variable you
want to include in the boxplot, and the x aes
thetic is a
grouping variable. For instance, below we set gender
as the
aes
thetic mapping
for x, and
gained
as the aes
thetic mapping
for y. This creates a boxplot of the weight gained for mothers that had
male and female newborns. Note that the fill
argument is
not necessary, but sets a color for the boxplots.
ggplot(data = ncbirths, aes(x = gender, y = gained)) +
geom_boxplot(fill = "sienna")
Take some time to familiarize yourself with the different parts of the boxplot:
Make a boxplot of the weight gained
by moms, split
by the maturity status of the mothers (mature
). Include
axis labels and a title on your plot. Is the median
weight gain during pregnancy larger for younger or older moms?
Make a boxplot of pregnancy duration in weeks
by
smoking habit
. Is the duration of pregnancy more
variable for smokers or non-smokers? (i.e. which group
has the greater spread for the variable weeks
?). What do
you think the “NA” means?
For the following, you need to determine which type of plot to use, make the plot, and answer any questions with text. The last few exercises require you to duplicate plots I’ve given using a few other datasets. The ggplot2 cheatsheet is a great guide to choosing the right plot (Help|Cheatsheets|Data visualization with ggplot2). It’s split by the number of variables in your plot as well as the type of variables.
Using a data visualization, visually assess: Is the variable for
father’s age (fage
) symmetrical, or does it have a
skew?
A. Using a data visualization, visually assess:
(in this sample) is the median birth weight
of babies
greater for white or non-white mothers (variable called
whitemom
)? Why do you think there is an “NA” group?
B. Discuss (in your group and in your report) whether
you view any ethical dillemmas with the variable whitemom
.
Some questions to consider: how was this information originally
collected? Would findings related to this variable be able to help all
women or just some women?
Using a data visualization, visually assess: (in this sample) as
a mother’s age (mage
) increases, does the duration of
pregnancy (weeks
) appear to decrease? Hint: Try using
geom_jitter
to see more of the data. You don’t need to set
any arguments for this.
Recreate the plots below…
mpg
dataset in the
tidyverse
package.economics
dataset in the
tidyverse
package. There is info you need in the
caption.diamonds
dataset in the
tidyverse
package. This one is harder. You need to look up
how to make a density plot in ggplot2
(and
what it is). There is info you need in the caption.When you are finished with the lab, go to the very top and change the
output from html_document
to pdf_document
. The
pdf document doesn’t look as nice, but it is easier to grade and upload
to schoology. Now turn in this PDF file to Schoology. Note the due date
and time. If Schoology says it’s late, it’s late. Make sure your
final Markdown document Knits properly and shows all your work. Look
through it to make sure everything looks organized and professional.
Also remember that if you needed output (graphs, numeric output, etc.)
to answer a question, the code to generate that output needs to be in
the lab report. Other code should not be included.